Functions and Character Manipulation
1 Functions
When you create a function, it defines a separate environment and the variables you
create inside your function only exist in that function environment; when you return
to where you called the function from, those variables no longer exist.
You can refer to other objects that are in the calling environment, but if you make any
changes to them, the changes will only take place in the function environment.
To get information
back to the calling environment, you must pass a return value, which will be available
through the functions name. R will automatically return the last unassigned value it
encounters in your function, or
you can place the object you want to return in a
call to the return function. You can only return a single object from a function
in R; if you need to return multiple objects, you need to return a list containing those
objects, and extract them from the list when you return to the calling environment.
As a simple example of a function that returns a value, suppose we want to calculate the
ratio of the maximum value of a vector to the minimum value of the vector. Here's a
function definition that will do the job:
maxminratio = function(x)max(x)/min(x)
Notice for a single line function you don't need to use brackets ({}) around the
function body, but you are free to do so if you like. Since the final statement wasn't
assigned to a variable, it will be used as a return value when the function is called.
Alternatively, the value could be placed in a call to the return function.
If we wanted to find the max to min ratio for all the columns of the matrix,
we could use our function with the apply function:
apply(mymat,2,maxminratio)
The 2 in the call to apply tells it to operate on the
columns of the matrix; a 1 would be used to work on the rows.
Before we leave this example, it should be pointed out that this function has
a weakness - what if we pass it a vector that has missing values? Since we're
calling min and max without the na.rm=TRUE argument,
we'll always get a missing value if our input data has any missing values.
One way to solve the problem is to just put the na.rm=TRUE argument
into the calls to min and max. A better way would be to
create a new argument with a default value. That way, we still only have to
pass one argument to our function, but we can modify the
na.rm= argument if we need to.
maxminratio = function(x,na.rm=TRUE)max(x,na.rm=na.rm)/min(x,na.rm=na.rm)
If you look at the function definitions for functions in
R, you'll see that many of them use this method of setting defaults in the
argument list.
As another example of a function, recall the graph of income versus literacy with
different colored points
for the different continents.
If we were working with the datasets like the world1 dataset
and wanted to create a variety of plots,
we could write a function like this:
worldplotter = function(data,xvar,yvar,cvar,colors,ltitle=cvar,legendloc='topleft'){
colorvar = factor(data[,cvar])
with(data,plot(data[,xvar],data[,yvar],col=colors[colorvar],xlab=xvar,ylab=yvar))
with(data,legend(legendloc,legend=levels(colorvar),col=colors,pch=1,title=ltitle))
}
Now we could produce the income versus literacy graph by calling:
worldplotter(world1,'literacy','income','cont',c('red','blue','green','orange','yellow','violet'))
By changing the arguments, a variety of plots can be produced.
As your functions get longer and more complex, it becomes more difficult to simply type
them into an interactive R session. To make it easy to edit functions, R provides the
edit command, which will open an editor appropriate to your operating system.
When you close the editor,
the edit function will return the edited copy of your function, so it's important
to remember to assign the return value from edit to the function's name.
If you've already defined a function, you can edit it by simply passing it to
edit, as in
minmaxratio = edit(minmaxratio)
You may also want to consider the fix function, which automates
the process slightly.
To start from scratch, you can use a call to edit like this:
newfunction = edit(function(){})
2 Sizes of Objects
Before we start looking at character manipulation, this is a good time to
review the different functions that give us the size of an object.
- length - returns the length of a vector, or the total number
of elements in a matrix (number of rows times number of columns). For
a data frame, returns the number of columns.
-
dim - for matrices and data frames, returns a vector of length
2 containing the number of rows and the number of columns. For a vector,
returns NULL. The convenience functions nrow and ncol
return the individual values that would be returned by dim.
-
nchar - for a character string, returns the number of characters
in the string. Returns a vector of values when applied to a vector of
character strings. For a numeric value, nchar returns the number
of characters in the printed representation of the number.
3 Character Manipulation
While it's quite natural to think of data as being numbers, manipulating
character strings is also an important skill when working with data. We've
already seen a few simple examples, such as choosing the right format for a
character variable that represents a date, or using table to tabulate
the occurences of different character values for a variable. Now we're going
to look at some functions in R that let us break apart, rearrange and put
together character data.
One of the most important uses of character manipulation is "massaging" data
into shape. Many times the data that is available to us, for example on a web
page or as output from another program, isn't
in a form that a program like R can easily
interpret. In cases like that, we'll need to remove the parts that R can't
understand, and organize the remaining parts so that R can read them
efficiently.
Let's take a look at some of the functions that R offers for working with
character variables:
- paste
The paste function converts its arguments to character before operating
on them, so you can pass both numbers and strings to the function. It
concatenates the arguments passed to it, to create new strings that are
combinations of other strings. paste accepts an unlimited number of
unnamed arguments, which will be pasted together, and one or both of the
arguments sep= and collapse=. Depending on whether the
arguments are scalars or vectors, and which of sep= and collapse=
are used, a variety of different tasks can be performed.
- If you pass a single argument to paste, it will return a
character representation:
> paste('cat')
[1] "cat"
> paste(14)
[1] "14"
- If you pass more than one scalar argument to paste, it will put them
together in a single string, using the sep= argument to separate the
pieces:
> paste('stat',133,'assignment')
[1] "stat 133 assignment"
- If you pass a vector of character values to paste, and the
collapse= argument is not NULL, it pastes together the elements
of the vector, using the collapse= argument as a separator:
> paste(c('stat',133,'assignment'),collapse=' ')
[1] "stat 133 assignment"
- If you pass more than one argument to paste, and any of those
arguments is a vector, paste will return a vector as long as its'
longest argument, produced by pasting together corresponding pieces of the
arguments. (Remember the recycling rule which will be used if the vector
arguments are of different lengths.) Here are a few examples:
> paste('x',1:10,sep='')
[1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"
> paste(c('x','y'),1:10,sep='')
[1] "x1" "y2" "x3" "y4" "x5" "y6" "x7" "y8" "x9" "y10"
- grep
The grep function searches for patterns in text. The first argument
to grep is a text string or regular expression that you're looking
for, and the second argument is usually a vector of character values.
grep returns the indices of those elements of the vector of character
strings that contain the text string. Right now we'll limit ourselves to
simple patterns, but later we'll explore the full strength of commands like
this with regular expressions.
grep can be used in a number of ways. Suppose we want to see the
countries of the world that have the world 'United' in their names.
> grep('United',world1$country)
[1] 144 145
grep returns the indices of the observations that have
'United' in their names. If we wanted to see the values of country
that had 'United' in their names, we can use the value=TRUE argument:
> grep('United',world1$country,value=TRUE)
[1] "United Arab Emirates" "United Kingdom"
Notice that, since the first form of grep returns a vector of indices,
we can use it as a subscript to get all the information about the countries
that have 'United' in their names:
> world1[grep('United',world1$country),]
country gdp income literacy military cont
144 United Arab Emirates 23200 23818 77.3 1600000000 AS
145 United Kingdom 27700 28938 99.9 42836500000 EU
grep has a few optional arguments, some of which we'll look at later. One
convenient argument is ignore.case=TRUE, which, as the name implies will
look for the pattern we specified without regard to case.
- strsplit
strsplit takes a character vector, and breaks each element up into
pieces, based on the value of the split= argument. This argument can
be an ordinary text string, or a regular expression. Since the different
elements of the vector may have different numbers of "pieces", the results
from strsplit are always returned in a list. Here's a simple example:
> mystrings = c('the cat in the hat','green eggs and ham','fox in socks')
> parts = strsplit(mystrings,' ')
> parts
[[1]]
[1] "the" "cat" "in" "the" "hat"
[[2]]
[1] "green" "eggs" "and" "ham"
[[3]]
[1] "fox" "in" "socks"
While we haven't dealt much with lists before, one function that can
be very useful is sapply; you can use sapply to operate on
each element of a list, and it will, if possible, return the result as a vector.
So to find the number of words in each of the character strings in mystrings,
we could use:
> sapply(parts,length)
[1] 5 4 3
-
substring
The substring function allows you to extract portions of a character
string. Its first argument is a character string, or vector of character strings,
and its second argument is the index (starting with 1) of the beginning of the
desired substring. With no third argument, substring returns the string
starting at the specified index and continuing to the end of the string; if a
third argument is given, it represents the last index of the original string
that will be included in the returned substring. Like many functions in R, its
true value is that it is fully vectorized: you can extract substrings of a vector
of character values in a single call. Here's an example of a simple use of
substring
> strings = c('elephant','aardvark','chicken','dog','duck','frog')
> substring(strings,1,5)
[1] "eleph" "aardv" "chick" "dog" "duck" "frog"
Notice that, when a string is too short to fully meet a substringing
request, no error or warning is raised, and substring returns as much
os the string as is there.
Consider the following example, extracted from a web page. Each element of
the character vector data consists of a name followed by
five numbers. Extracting an individual field, say the field with the state
names is straight forward:
> data = c("Lyndhurst Ohio 199.02 15,074 30 5 25",
"Southport Town New York 217.69 11,025 24 4 20",
"Bedford Massachusetts 221.20 12,658 28 0 28")
> states = substring(data,16,28)
> states
[1] "Ohio " "New York " "Massachusetts"
It is possible to extract all the fields at once, at the cost of
a considerably more complex call to substring:
> starts = c(1,16,30,38,46,50,54)
> ends = c(14,28,35,43,47,50,55)
> ldata = length(data)
> lstarts = length(starts)
> x = substring(data,rep(starts,rep(ldata,lstarts)),rep(ends,rep(ldata,lstarts)))
> matrix(x,ncol=lstarts)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "Lyndhurst " "Ohio " "199.02" "15,074" "30" "5" "25"
[2,] "Southport Town" "New York " "217.69" "11,025" "24" "4" "20"
[3,] "Bedford " "Massachusetts" "221.20" "12,658" "28" "0" "28"
Like many functions in R, substring can appear on the left hand side
of an assignment statement, making it easy to change parts of a character string
based on the positions they're in. To change the third through fifth digits of
a set of character strings representing numbers to 99, we could use:
> nums = c('12553','73911','842099','203','10')
> substring(nums,3,5) = '99'
> nums
[1] "12993" "73991" "849999" "209" "10"
- tolower, toupper
These functions convert their arguments to all upper-case characters or
all lower-case characters, respectively
-
sub, gsub
These functions change a regular expression or text pattern to a different
set of characters. They differ in that sub only changes the first
occurence of the specified pattern, while gsub changes all of the
occurences. Since numeric values in R cannot contain dollar signs or commas,
one important use of gsub is to create numeric variables from text
variables that represent numbers but contain commas or dollars. For example,
in gathering the data for the world dataset that we've been using, I extracted
the information about military spending from
http://en.wikipedia.org/wiki/List_of_countries_by_military_expenditures. Here's an
excerpt of some of the values from that page:
> values = c('370,700,000,000','205,326,700,000','67,490,000,000')
> as.numeric(values)
[1] NA NA NA
Warning message:
NAs introduced by coercion
The presence of the commas is preventing R from being able to convert
the values into actual numbers. gsub easily solves the problem:
> as.numeric(gsub(',','',values))
[1] 370700000000 205326700000 67490000000
4 Working with Characters
As you probably noticed when looking at the above functions, they are very simple,
and, quite frankly, it's hard to see how they could really do anything complex on their
own. In fact, that's just the point of these functions - they can be combined together
to do just about anything you would want to do. As an example, consider the task of
capitalizing the first character of each word in a string. The toupper function
can change the case of all the characters in a string, but we'll need to do something
to separate out the characters so we can get the first one. If we call strsplit
with an empty string for the splitting character, we'll get back a vector of the individual
characters:
> str = 'sherlock holmes'
> letters = strsplit(str,'')
> letters
[[1]]
[1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s"
> theletters = letters[[1]]
Notice that strsplit always returns a list. This will be very
useful later, but for now we'll extract the first element before
we try to work with its output.
The places that we'll need to capitalize things are the first position in the vector
or letters, and any letter that comes after a blank. We can find those positions
very easily:
> wh = c(1,which(theletters == ' ') + 1)
> wh
[1] 1 10
We can change the case of the letters whose indexes are in wh,
then use paste to put the string back together.
> theletters[wh] = toupper(theletters[wh])
> paste(theletters,collapse='')
[1] "Sherlock Holmes"
Things have gotten complicated enough that we could probably stand to
write a function:
maketitle = function(txt){
theletters = strsplit(txt,'')[[1]]
wh = c(1,which(theletters == ' ') + 1)
theletters[wh] = toupper(theletters[wh])
paste(theletters,collapse='')
}
Of course, we should always test our functions:
> maketitle('some crazy title')
[1] "Some Crazy Title"
Now suppose we have a vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')
We can always hope that we'll get the right answer if we just use
our function:
> maketitle(titls)
[1] "Sherlock Holmes"
Unfortunately, it didn't work in this case. Whenever that happens,
sapply will operate on all the elements in the vector:
> sapply(titls,maketitle)
sherlock holmes avatar book of eli up in the air
"Sherlock Holmes" "Avatar" "Book Of Eli" "Up In The Air"
Of course, this isn't the only way to solve the problem. Rather than break
up the string into individual letters, we can break it up into words, and
capitalize the first letter of each, then combine them back together. Let's
explore that approach:
> str = 'sherlock holmes'
> words = strsplit(str,' ')
> words
[[1]]
[1] "sherlock" "holmes"
Now we can use the assignment form of the substring
function to change the first letter of each word to a capital. Note that
we have to make sure to actually return the modified string from our call
to sapply, so we insure that the last statement in our function
returns the string:
> sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
sherlock holmes
"Sherlock" "Holmes"
Now we can paste the pieces back together to get our answer:
> res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
> paste(res,collapse=' ')
[1] "Sherlock Holmes"
To operate on a vector of strings, we'll need to incorporate
these steps into a function, and then call sapply:
mktitl = function(str){
words = strsplit(str,' ')
res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
paste(res,collapse=' ')
}
We can test the function, making sure to use a string different than
the one we used in our initial test:
> mktitl('some silly string')
[1] "Some Silly String"
And now we can test it on the vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')
> sapply(titls,mktitl)
sherlock holmes avatar book of eli up in the air
"Sherlock Holmes" "Avatar" "Book Of Eli" "Up In The Air"
How can we compare the two methods? The R function system.time will
report the amount of time any operation in R uses. One important caveat -
if you wish to assign an expression to a value in the system.time
call, you must use the "<-" assignment operator, because the
equal sign will confuse the function into thinking you're specifying a
named parameter in the function call. Let's try system.time on
our two functions:
> system.time(one <- maketitle(titls))
user system elapsed
0 0 0
> system.time(two <- mktitl(titls))
user system elapsed
0.000 0.000 0.001
For such a tiny example, we can't really trust that the difference
we see is real. Let's use the movie names from a previous example:
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> nms = tolower(movies$name)
> system.time(one <- maketitle(nms))
user system elapsed
0.000 0.000 0.001
> system.time(two <- mktitl(nms))
user system elapsed
0.008 0.000 0.007
It looks like the first method is better than the second. Of
course, if they don't get the same answer, it doesn't really matter how
fast they are. In R, the all.equal function can be used to
see if things are the same:
> all.equal(one,two)
[1] TRUE
File translated from
TEX
by
TTH,
version 3.67.
On 8 Feb 2010, 13:59.